Established in 2000, Transport for London (TfL) stands as the vital transport network within the United Kingdom's capital. Its role is to facilitate the London’s web of transportation systems throughout the city, integrating various modes of public transportation, including buses, the London Underground (Tube), London Overground (LO), trams, and more, under a governance structure. The early roots of TfL trace back to earlier networks responsible for London's transportation, named London Transport formed in 1933. When the TfL was created, this marked a pivotal moment in the evolution of London's transportation system, as it created an approach to a more cohesive urban mobility system. Currently overseeing an intricate network that spans approximately 700 miles of roads and 1,000 kilometres of cycle lanes, the TfL has also revolutionised fare payment with over 1 billion contactless journeys annually, and contributed to the reduction of carbon emissions, introducing hybrid and electric buses. Furthermore, the TfL holds global significance, as its achievements in congestion management, environmental sustainability, and accessibility have inspired cities worldwide to recalibrate their own transit ecosystems.
Many determinants influence millions of TfL journeys that occur daily, including economic shifts, population dynamics, technological advancements, urban development, and policy interventions. By analysing historical and contemporary factors that influence the number of journeys across TfL’s various transport modes, we aim to gain key insights and a broader understanding of the future of London's transport network. Through this exploration, we unveil patterns and trends that contribute to informed decision-making in urban planning and policy making, which will ultimately pave the way for a more resilient and responsive transport future in London.

For our project, we have chosen to explore the number of journeys across different modes of public transport in London over the years from April 2010 to the December 2023. We have leveraged multiple datasets reliably sourced from the London Datastore, the TfL and Government websites to analyse for this project. We delve into understanding the trends associated with number of journeys periodically, across various transportation modes such as buses, London Underground, Dockland Light Railway (DLR), London Tramlink, and London Overground. We also explore other determinants such as transport crime data emphasised across these modes, traffic volumes across various local authorities in London and performance levels of the transport system. With this we can uncover the intrinsic significance public transport plays in urban safety and public welfare in London. Focusing on temporal and modal variations of the millions of journeys that occur every day, we can unearth insights into the crucial role that transport networks play in the daily lives of millions of the city's inhabitants, and predict future number of journeys made throughout the TfL system. This in turn can aid with urban planning, law enforcement, and future policy making concerning the overall well-being of London's commuters. In this notebook, we analyse previous travel data, uncovering insightful trends through Exploratory Data Analysis to answer important research questions around factors influencing these patterns. We also use regression analysis to predict future number of journeys ultimately contributing towards a goal of fostering a more efficient, connective and secure public transportation environment for the citizens of London.
# Uncomment the following line if the "shutup" package is not already installed on your device
#!pip install shutup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
import shutup; shutup.please()
The datasets we have chosen to analyse in this project, provided in the form of Excel and CSV spreadsheets, include:
These datasets are licenced from the UK government website, ensuring reliability and accuracy of regularly up-to-date travel data. The website offers a user-friendly interface for easy accessibility to the public, which again establishes its ethical standards of complete transparency to its users.
The "Public Transport Journeys by Type of Transport" data provides the number of journeys in millions on the public transport network by type of transport, broken down by bus, underground, DLR, tram, Overground and cable car. The data is presented as an Excel Spreadsheet, from 28/04/2010 to 09/12/2023, monthly. The data is reported for different periods, with varying period lengths in periods 1 and 13, reported data is now adjusted for the differences in period lengths. Journey counts for the Docklands Light Railway are derived from automatic passenger counts at stations, while Overground and Tram journeys are based on automatic on-carriage passenger counts. Additionally, reliable journey numbers for the Overground have been available only since October 2010.
The "Underground services performance – Service Operated" data provides performance of specific underground lines against their key performance metrics. It is presented in a CSV file, reporting periods containing 4 weeks, with 13 periods in a financial year, from April 2018 – December 2023. This data compares the actual number of tube trips versus the scheduled number of tube trips over time using a predetermined set of measuring points. It is based on current working timetables, adjusted for planned closures, engineering works, impacts of three hours or more due to system issues, industrial action, force majeure, or unplanned events.
The "Transport Crime in London" data provides monthly breakdowns of crime volume and rate of crime per million passenger journeys, in an Excel Spreadsheet, from 01/04/2009 to 31/03/2023. The data provides reporting of rail-related crimes, such as LO, LU, DLR and Tramlink networks handled by the British Transport Police (BTP), and the bus network overseen by the Metropolitan Police Service (MPS). London Overground data contains missing observations before April 2011 due to passenger journey information limitations. TfL rail and DLR were introduced to the data in 2017 and 2018 respectively, and the TfL rail is later renamed to the Elizabeth Line in 2023.
The "Traffic Flows" data provides annual traffic volume by total distance travelled by all motor vehicles on the roads of London, in million kilometres. This data is presented in an Excel Spreadsheet, from 1993 to 2022, and broken down by London Boroughs. Some key events that affected these figures include the September 2000 fuel protest, Foot and Mouth disease in 2001 and the coronavirus (COVID-19) pandemic.
# Importing journeys data
n_journeys = pd.read_excel('data/tfl-journeys-type.xlsx', 'Journeys', parse_dates=['Period ending']).iloc[:,4:]
# Importing traffic flows data
traffic = pd.read_excel('data/traffic-flow-borough.xlsx', 'Traffic Flows - Cars')
# Importing crime data
crime_10 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4)
crime_11 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=7)
crime_12 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=14)
crime_13 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=21)
crime_14 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=29)
crime_15 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=37)
crime_16 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=45)
crime_17 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 5, skiprows=53)
crime_18 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=62)
crime_19 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=72)
crime_20 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=82)
crime_21 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=92)
crime_22 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=102)
crime_23 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=112)
# Importing performance data
service_operated = pd.read_csv('data/service-operated.csv').iloc[:,:9]
Initially, we import all data into the Python notebook and carry out the cleaning process to remove missing variables and duplicates. We further provide the data in a formatted table, easier to read, analyse and manipulate, for example with number of journeys data, we provide dates of reported journeys as rows, and number of journeys across different transport modes as the columns, as shown below.
# Changing the name of the columns
n_journeys.columns=['date','bus_n','tube_n','dlr_n','tram_n','og_n','cable_n','eliz_n']
# Setting dates as index
n_journeys.set_index('date',inplace=True)
# Aggregating data by years
n_journeys_y = n_journeys.groupby(n_journeys.index.to_period('Y').to_timestamp()).sum(min_count = 1)
# Aggregating data by month
n_journeys_m = n_journeys.groupby(n_journeys.index.to_period('M').to_timestamp()).mean()
# Aggregating data for all the means of transportation
n_journeys_total = n_journeys.agg('sum', axis=1)
n_journeys.tail()
| bus_n | tube_n | dlr_n | tram_n | og_n | cable_n | eliz_n | |
|---|---|---|---|---|---|---|---|
| date | |||||||
| 2023-08-19 | 129.710544 | 86.647955 | 7.213507 | 1.425394 | 12.375553 | 0.177132 | 15.359475 |
| 2023-09-16 | 139.594975 | 83.658393 | 7.301096 | 1.688012 | 13.837056 | 0.142462 | 15.627150 |
| 2023-10-14 | 155.412029 | 93.392539 | 8.280110 | 1.631487 | 14.841794 | 0.093054 | 17.298925 |
| 2023-11-11 | 146.374533 | 95.801186 | 7.759405 | 1.268640 | 14.983251 | 0.100782 | 17.838075 |
| 2023-12-09 | 150.650065 | 101.769430 | 7.925633 | 1.734252 | 14.982537 | 0.065571 | 17.778625 |
In the cleaning process of the number of journeys. The columns' names were changed to more meaningful ones, specifying the types of transport (bus, tube, DLR, tram, Overground, cable car, Elizabeth line). The 'date' column was set as the index, facilitating time-based analysis, and data was then aggregated by years (n_journeys_y) and by months (n_journeys_m). Additionally, a new series (n_journeys_total) was created, representing the total number of journeys across all means of transportation.
n_journeys.describe()
| bus_n | tube_n | dlr_n | tram_n | og_n | cable_n | eliz_n | |
|---|---|---|---|---|---|---|---|
| count | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 171.000000 | 149.000000 | 112.000000 |
| mean | 159.202482 | 88.861158 | 7.599498 | 1.996925 | 11.389076 | 0.110158 | 5.295398 |
| std | 35.125796 | 23.992997 | 1.885426 | 0.458678 | 3.495306 | 0.059092 | 4.227002 |
| min | 30.223736 | 5.745632 | 1.205125 | 0.440934 | 0.999693 | 0.000169 | 0.594615 |
| 25% | 145.235034 | 84.620162 | 6.485058 | 1.688709 | 8.944797 | 0.075648 | 3.276068 |
| 50% | 173.661797 | 94.207425 | 7.739988 | 2.157100 | 11.582781 | 0.110490 | 3.768055 |
| 75% | 182.448439 | 105.077546 | 9.162448 | 2.315166 | 14.398171 | 0.132138 | 4.642658 |
| max | 207.509939 | 118.222383 | 10.636562 | 2.765871 | 17.820632 | 0.534218 | 17.838075 |
fig_box = go.Figure()
fig_box.add_trace(go.Box(y=n_journeys['tram_n'], name= 'tram_n', notched=True, boxpoints='all', fillcolor= 'lawngreen'))
fig_box.add_trace(go.Box(y=n_journeys['eliz_n'], name= 'eliz_n', notched=True, boxpoints='all', fillcolor= 'darkviolet'))
fig_box.add_trace(go.Box(y=n_journeys['dlr_n'], name= 'dlr_n', notched=True, boxpoints='all', fillcolor= 'turquoise'))
fig_box.add_trace(go.Box(y=n_journeys['og_n'], name= 'og_n', notched=True, boxpoints='all', fillcolor= 'orange'))
fig_box.add_trace(go.Box(y=n_journeys['tube_n'], name= 'tube_n', notched=True, boxpoints='all', fillcolor= 'blue'))
fig_box.add_trace(go.Box(y=n_journeys['bus_n'], name= 'bus_n', notched=True, boxpoints='all', fillcolor= 'red'))
fig_box.show()
Exploring the number of journeys across different transport types, through creating box-plots, we can clearly observe buses seeing a significantly higher passenger rate monthly, relative to its rail-service counterparts. The high median and large interquartile range suggests a positive skew in the number of bus passengers, stretching the distribution towards higher values. This can be due to higher accessibility, as buses tend to offer more door-door services compared to tubes or trams, especially for shorter distances. Other relevant factors include, being more cost effective, providing local connectivity between proximal neighbourhoods, and covering a more extensive area, all which attracts a diverse range of passengers.
We created an interactive boxplot with the ability to turn the legends on and off for different transport modes. This interactive feature enables viewers to compare the categories with ease, for example, as seen in the plot, tube and bus volumes are substantially higher than the other modes, so one can switch off these 2 legends, to more visibly compare the modes with significantly fewer number of journeys. This interactivity also allows users to explore the data dynamically, providing additional context on specific values, such as summary statistics, as a more hands-on approach to understand the trends within the data.
traffic_lnd = traffic[traffic['Local Authority']=='London'].transpose()[2:]
traffic_lnd.columns=['traffic']
traffic_lnd.index = pd.Series(pd.date_range("1993-01-01", periods=len(traffic_lnd), freq=pd.offsets.YearBegin()))
traffic_lnd.head()
| traffic | |
|---|---|
| 1993-01-01 | 25560.0 |
| 1994-01-01 | 25851.0 |
| 1995-01-01 | 25755.0 |
| 1996-01-01 | 26009.0 |
| 1997-01-01 | 26119.0 |
Over the years, as TfL shaped the way Londoners move, it has also faced challenges of ensuring the safety and well-being of travellers. From the early days of London Transport, the city has witnessed the significance of safeguarding passengers from various security threats as the transport infrastructure expanded. Transport crimes range from petty offenses to more serious incidents, across various modes of transport connecting through diverse neighbourhoods and urban landscapes, with policing strategies and security measures continuously evolving with considerations of safety to urban mobility.
lis = ['Vol','Rate']*12
# Setting the columns of the final output
crime = pd.DataFrame(columns = ['bus_vol', 'bus_rate','tube_vol', 'tube_rate', 'dlr_vol', 'dlr_rate',
'tram_vol', 'tram_rate', 'og_vol', 'og_rate', 'eliz_vol','eliz_rate'])
for i in range(10, 24):
# Creating some temporary dataframes for each year
df = globals()['crime_'+str(i)]
if len(df) == 4:
df.index = ['Bus', 'London Underground', 'London Overground', 'Trams']
elif len(df) == 5:
df.index = ['Bus', 'London Underground', 'London Overground', 'Tfl Rail', 'Trams']
else:
df.index = ['Bus', 'London Underground', 'Docklands Light Railway', 'London Overground', 'Tfl Rail', 'Trams']
# Setting the columns of the output dividing per year
globals()['table_'+str(i)] = pd.DataFrame(columns = ['bus_vol', 'bus_rate','tube_vol', 'tube_rate', 'dlr_vol',
'dlr_rate', 'tram_vol', 'tram_rate', 'og_vol', 'og_rate',
'eliz_vol','eliz_rate'])
invert = df.transpose()
invert.index = lis
for row in range(len(globals()['crime_'+str(i)])):
# Using the temporary dataframes to fill the output dataframe per year
if df.index[row] == 'Bus':
column = invert.loc['Vol']['Bus']
globals()['table_'+str(i)]['bus_vol'] = column.to_list()
column = invert.loc['Rate']['Bus']
globals()['table_'+str(i)]['bus_rate'] = column.to_list()
if df.index[row] == 'London Underground':
column = invert.loc['Vol']['London Underground']
globals()['table_'+str(i)]['tube_vol'] = column.to_list()
column = invert.loc['Rate']['London Underground']
globals()['table_'+str(i)]['tube_rate'] = column.to_list()
if df.index[row] == 'London Overground':
column = invert.loc['Vol']['London Overground']
globals()['table_'+str(i)]['og_vol'] = column.to_list()
column = invert.loc['Rate']['London Overground']
globals()['table_'+str(i)]['og_rate'] = column.to_list()
if df.index[row] == 'Trams':
column = invert.loc['Vol']['Trams']
globals()['table_'+str(i)]['tram_vol'] = column.to_list()
column = invert.loc['Rate']['Trams']
globals()['table_'+str(i)]['tram_rate'] = column.to_list()
if df.index[row] == 'Docklands Light Railway':
column = invert.loc['Vol']['Docklands Light Railway']
globals()['table_'+str(i)]['dlr_vol'] = column.to_list()
column = invert.loc['Rate']['Docklands Light Railway']
globals()['table_'+str(i)]['dlr_rate'] = column.to_list()
if df.index[row] == 'Tfl Rail':
column = invert.loc['Vol']['Tfl Rail']
globals()['table_'+str(i)]['eliz_vol'] = column.to_list()
column = invert.loc['Rate']['Tfl Rail']
globals()['table_'+str(i)]['eliz_rate'] = column.to_list()
# Concatenating the output per year in only one dataset
crime = pd.concat([crime,globals()['table_'+str(i)]])
# Setting the DatetimeIndex accordingly
crime.index = pd.Series(pd.date_range("2009-04-01", periods=168, freq="M"))
crime.replace('-', np.nan, inplace = True)
crime = crime.astype(float)
crime_m = crime.groupby(crime.index.to_period('M').to_timestamp()).mean()
crime_m.tail()
| bus_vol | bus_rate | tube_vol | tube_rate | dlr_vol | dlr_rate | tram_vol | tram_rate | og_vol | og_rate | eliz_vol | eliz_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022-11-01 | 1669.0 | 10.6 | 1725.0 | 18.1 | 98.0 | 12.1 | 37.0 | 20.6 | 160.0 | 10.7 | 139.0 | 8.5 |
| 2022-12-01 | 1396.0 | 10.1 | 1552.0 | 17.5 | 76.0 | 11.0 | 33.0 | 21.6 | 128.0 | 12.2 | 142.0 | 9.5 |
| 2023-01-01 | 1523.0 | 10.2 | 1687.0 | 19.1 | 67.0 | 8.7 | 41.0 | 22.6 | 123.0 | 9.2 | 108.0 | 6.9 |
| 2023-02-01 | 1436.0 | 10.2 | 1682.0 | 19.0 | 81.0 | 10.4 | 24.0 | 14.3 | 135.0 | 10.7 | 130.0 | 8.2 |
| 2023-03-01 | 1729.0 | 13.2 | 1917.0 | 24.9 | 85.0 | 12.0 | 28.0 | 18.5 | 171.0 | 14.5 | 128.0 | 9.6 |
In the cleaning process of crime numbers, a list was defined to specify the types of crimes (volume and rate) for each month. The final output DataFrame (crime) was created with columns representing different transportation modes and crime metrics. A loop was utilized to process each yearly dataset, and temporary DataFrames (table_10 to table_23) were created for each year to organize and structure the crime data. The loop involved transposing the original data, adjusting the index, and populating the output DataFrame accordingly. The resulting crime DataFrame was then indexed with a DatetimeIndex and further aggregated by monthly mean to create crime_m. Additionally, missing values represented as '-' were replaced with NaN, and the entire DataFrame was converted to a float data type. The final crime_m DataFrame provides a consolidated and organized view of crime-related data across different transportation modes and metrics over the specified time period.
crime_m.describe()
| bus_vol | bus_rate | tube_vol | tube_rate | dlr_vol | dlr_rate | tram_vol | tram_rate | og_vol | og_rate | eliz_vol | eliz_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 72.000000 | 72.000000 | 168.000000 | 168.000000 | 168.000000 | 144.000000 | 84.000000 | 84.000000 |
| mean | 1496.053571 | 8.829991 | 1085.892857 | 12.249847 | 56.361111 | 8.128765 | 23.517857 | 10.989226 | 90.761905 | 8.907276 | 60.190476 | 14.408868 |
| std | 342.314616 | 1.862150 | 323.753640 | 6.522995 | 16.303724 | 3.778607 | 8.938227 | 4.018835 | 34.224206 | 4.894076 | 26.870271 | 5.937156 |
| min | 450.000000 | 6.100000 | 322.000000 | 6.300000 | 28.000000 | 3.200000 | 5.000000 | 2.900000 | 20.000000 | 4.300000 | 24.000000 | 6.600000 |
| 25% | 1305.750000 | 7.344459 | 895.750000 | 8.575000 | 44.000000 | 5.575000 | 17.000000 | 8.200000 | 64.000000 | 6.600000 | 38.750000 | 9.975000 |
| 50% | 1485.000000 | 8.310099 | 1046.000000 | 10.600000 | 55.500000 | 7.525941 | 22.000000 | 10.300000 | 88.000000 | 7.750000 | 53.000000 | 13.426945 |
| 75% | 1661.500000 | 10.000000 | 1209.500000 | 13.525000 | 66.250000 | 9.575000 | 29.250000 | 13.500000 | 116.250000 | 9.525000 | 74.250000 | 17.132027 |
| max | 2402.000000 | 16.900000 | 2597.000000 | 63.400000 | 107.000000 | 26.600000 | 49.000000 | 22.600000 | 171.000000 | 50.100000 | 142.000000 | 38.300000 |
fig_viol_vol = go.Figure()
fig_viol_vol.add_trace(go.Violin(y=crime['bus_vol'], name='bus_vol', line_color='red'))
fig_viol_vol.add_trace(go.Violin(y=crime['tube_vol'], name='tube_vol', line_color='blue'))
fig_viol_vol.add_trace(go.Violin(y=crime['dlr_vol'], name='dlr_vol', line_color='turquoise'))
fig_viol_vol.add_trace(go.Violin(y=crime['tram_vol'], name='tram_vol', line_color='lawngreen'))
fig_viol_vol.add_trace(go.Violin(y=crime['og_vol'], name='og_vol', line_color='orange'))
fig_viol_vol.add_trace(go.Violin(y=crime['eliz_vol'], name='eliz_vol', line_color='darkviolet'))
fig_viol_vol.update_traces(box_visible=True, points='all', jitter = 0.05, meanline_visible=True)
fig_viol_rate = go.Figure()
fig_viol_rate.add_trace(go.Violin(y=crime['bus_rate'], name='bus_rate', line_color='red'))
fig_viol_rate.add_trace(go.Violin(y=crime['tube_rate'], name='tube_rate', line_color='blue'))
fig_viol_rate.add_trace(go.Violin(y=crime['dlr_rate'], name='dlr_rate', line_color='turquoise'))
fig_viol_rate.add_trace(go.Violin(y=crime['tram_rate'], name='tram_rate', line_color='lawngreen'))
fig_viol_rate.add_trace(go.Violin(y=crime['og_rate'], name='og_rate', line_color='orange'))
fig_viol_rate.add_trace(go.Violin(y=crime['eliz_rate'], name='eliz_rate', line_color='darkviolet'))
fig_viol_rate.update_traces(box_visible=True, points='all', jitter = 0.05, meanline_visible=True)
Assessing the crime rate across transportation modes, rather than crime volume, gives us a much more accurate comparison, easier to visualise. These violin box plots are useful for showing the distribution and probability density of a crime data across different transport modes. When interpreting distribution of the data, we can observe that the width of all types of transportation are very similar, and relatively symmetrical, suggesting similar probability density and small skewness. Another key observation is the elongated tail for tube and overground crime rate above the violin shape, suggesting that the spread of outliers are relatively extreme.
Imported data on performance, which we measure as the percentage of service operated (number of train departures compared to the scheduled one). When cleaning performance data, we converted the percentual data to floats, and fixed the financial period of 13 months that was provided in the data, to 12 calendar months, for ease of manipulation.
# Selecting only the summed up general network data
service_tube = service_operated[service_operated.Line == 'Network'].reset_index(drop=True)
# Converting the percentual to float
service_tube = pd.DataFrame(service_tube['Service Operated for Period - All Week'].str.rstrip('%').astype(float)/100)
service_tube.columns = ['performance']
# Fixing the index using financial months as in the dataset
service_tube.index = pd.Series(pd.date_range("2018-01-01", periods=len(service_tube), freq='28D'))
# Aggregating the data monthly taking the mean to make comparison with the other datasets
service_tube_m = service_tube.groupby(service_tube.index.to_period('M').to_timestamp()).mean()
service_tube.tail()
| performance | |
|---|---|
| 2023-05-15 | 0.910 |
| 2023-06-12 | 0.907 |
| 2023-07-10 | 0.920 |
| 2023-08-07 | 0.912 |
| 2023-09-04 | 0.900 |
Data Visualisation makes it easier to grasp patterns and relationships within complex, numerical datasets in a clear and concise way. Using Plotly, we created line plots comparing the number of passengers and Traffic, over time. Visualising the data in a temporal fashion can help predict future trends in journeys and aid decision-making regarding travel policing on London.
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys_total, name="tfl_n"), secondary_y=False)
fig.add_trace(go.Scatter(x=traffic_lnd.index, y=traffic_lnd['traffic'], name="traffic"), secondary_y=True)
# Add figure title
fig.update_layout(title_text="Number of TfL Customers vs Traffic",
xaxis_range=[dt.date(2010,1,1), dt.date(2022,10,1)],xaxis_rangeslider_visible=True)
# Set x-axis title
fig.update_xaxes(title_text="Date")
# Set y-axes titles
fig.update_yaxes(title_text="Millions of Passengers", secondary_y=False)
fig.update_yaxes(title_text="Millions of Kilometres Travelled by Cars", secondary_y=True)
fig.show()
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['tram_n'], fill='tozeroy', line_color= 'lawngreen',
name='tram_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['eliz_n'], fill='tonexty', line_color= 'darkviolet',
name='eliz_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['dlr_n'], fill='tozeroy', line_color= 'turquoise',
name='dlr_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['og_n'], fill='tonexty', line_color= 'orange',
name='og_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['tube_n'], fill='tonexty', line_color= 'blue',
name='tube_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['bus_n'], fill='tonexty', line_color= 'red',
name='bus_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys_total, fill='tonexty', line_color= 'navy',
name='tfl_n'))
# Events on London Underground
signif_dates_tube = [dt.date(2012,12,8),dt.date(2021,9,18)]
days_tube = [dt.date(2012,12,8),dt.date(2021,9,18)]
scatter_tube = n_journeys.tube_n[n_journeys.index.isin(signif_dates_tube)]
actual_text_tube= ['<br>Introduction of Contactless Payment</br>',
'<br>Northern Line Extension:</br>Kennington to Battersea']
hovertext_tube = ['<b>' + str(day) + '</b>' + actual_text_tube[i] for i, day in enumerate(days_tube)]
fig1.add_trace(go.Scatter(x=scatter_tube.index, y=scatter_tube, mode='markers',
name='events_tube', hovertext=hovertext_tube, hoverinfo="text",
marker=dict(color="red", size = [10]*3)))
# Events on London Overground
signif_dates_og = [dt.date(2011,3,5),dt.date(2012,12,8),dt.date(2015,5,30)]
days_og = [dt.date(2011,2,28),dt.date(2012,12,9),dt.date(2015,5,31)]
scatter_og = n_journeys.og_n[n_journeys.index.isin(signif_dates_og)]
actual_text_og= ['<br>OG Addition:</br>Dalston Junction to Highbury & Islington',
'<br>OG Addition:</br>Surrey Quays to Clapham Junction',
'<br>OG Addition:</br>Liverpool Street to Enfield Town, Cheshunt and Clingford<br>\
Romford to Upminster</br>']
hovertext_og = ['<b>' + str(day) + '</b>' + actual_text_og[i] for i, day in enumerate(days_og)]
fig1.add_trace(go.Scatter(x=scatter_og.index, y=scatter_og, mode='markers',
name='events_og', hovertext=hovertext_og, hoverinfo="text",
marker=dict(color="blue", size = [10]*3)))
# Events on DLR
signif_dates_dlr = [dt.date(2011,8,20),dt.date(2015,11,14)]
days_dlr = [dt.date(2011,8,31),dt.date(2015,11,11)]
scatter_dlr = n_journeys.dlr_n[n_journeys.index.isin(signif_dates_dlr)]
actual_text_dlr= ['<br>DLR Extension:</br>Canning Town to Stratford',
'<br>Change of fares:</br>From Zone 3 to Zone 2/3']
hovertext_dlr = ['<b>' + str(day) + '</b>' + actual_text_dlr[i] for i, day in enumerate(days_dlr)]
fig1.add_trace(go.Scatter(x=scatter_dlr.index, y=scatter_dlr, mode='markers',
name='events_dlr', hovertext=hovertext_dlr, hoverinfo="text",
marker=dict(color="green", size = [10]*3)))
# Events on Elizabeth Line
signif_dates_eliz = [dt.date(2018,5,26),dt.date(2019,12,7),dt.date(2022,5,28),dt.date(2023,5,27)]
days_eliz = [dt.date(2018,5,31),dt.date(2019,12,15),dt.date(2022,5,24),dt.date(2023,5,21)]
scatter_eliz = n_journeys.eliz_n[n_journeys.index.isin(signif_dates_eliz)]
actual_text_eliz = ['<br>TfL Rail Extension:</br>Paddington to Heathrow',
'<br>TfL Rail Extension:</br>Paddington to Reading',
'<br>TfL Rail Extension:</br>Paddington to Abbey Wood<br>Rebrand to Elizabeth Line</br>',
'<br>Elizabeth Line Extension:</br>Paddington to Shenfield<br>Reading to Abbey Wood<br>\
Heathrow to Abbey Wood</br>']
hovertext_eliz = ['<b>' + str(day) + '</b>' + actual_text_eliz[i] for i, day in enumerate(days_eliz)]
fig1.add_trace(go.Scatter(x=scatter_eliz.index, y=scatter_eliz, mode='markers',
name='events_eliz', hovertext=hovertext_eliz, hoverinfo="text",
marker=dict(color="red", size = [10]*4)))
fig1.update_layout(title_text="Number of Passengers on TfL Network", xaxis_title="Date",
yaxis_title="Millions of Passengers", xaxis_range=[dt.date(2010,5,1), dt.date(2023,12,9)],
xaxis_rangeslider_visible=True)
fig1.show()
Producing a temporal plot for number of passengers, and comparing modes of transit over the years 2009-2019, we observe TfL rail, buses and Tube seeing a higher volume of travellers, contrasting with the significantly lower number for trams, overground, DLR, and Elizabeth line (previously known as the TFL rail) remaining consistently low throughout the focused extent of time. However, for all transportation modes studied, a sudden decrease in transport journeys in 2020 is observed. The most significant reason for this drop is the unprecedented changes in daily life and travel patterns following the global COVID-19 pandemic. During lockdowns, to restrict spread of the virus, public transport usage plummeted significantly.
To identify the significant events that took place in this time period, the viewer can switch off tfl_n, bus_n, tube_n and events_tube, to see a clearer graph highlighting the temporal changes following main events that have impacted the Overground, DLR and Elizabeth line. Hovering over the event that took place in 2015-05-31, highlighting the addition to the Overground Line, we can see a sharp increase in Overground passengers, following this point. A similar effect followed the event of DLR extension in 2011-08-31. However, an opposing effect occurred when the DLR underwent a change in fares in 2015-11-11, causing a slight reduction in DLR passengers. We also observe a significant jump in TfL rail passengers following the event that took place in 2022-05-24, regarding the TfL rail extension and rebrand to the Elizabeth line.
before2015_bus = n_journeys_m.bus_n[n_journeys_m.index<='2015-06-01'].mean()
after2015_bus = n_journeys_m.bus_n[(n_journeys_m.index>='2015-06-01') & (n_journeys_m.index<'2020-01-01')].mean()
plt.figure(figsize=(10,6))
plt.plot(n_journeys_m.bus_n[n_journeys_m.index<'2020-01-01'].index, n_journeys_m.bus_n[n_journeys_m.index<'2020-01-01'])
x1, y1 = [dt.date(2010,1,1), dt.date(2015,6,1)], [before2015_bus, before2015_bus]
x2, y2 = [dt.date(2015,6,1), dt.date(2020,1,1)], [after2015_bus, after2015_bus]
plt.plot(x1, y1, x2, y2, marker = 'o')
plt.xlabel('Date')
plt.ylabel('Millions of journeys')
plt.title('Number of journeys on the London Buses 2010-2020')
plt.legend(labels = ['Number of journeys on buses', 'Mean before June 2015', 'Mean after June 2015'])
plt.show()
We explore the change in number of bus journeys before and after June 2015, following the introduction of the TfL rail. This was a revolutionary point in the TfL, improving connectivity, faster journeys and offering a more direct route to commuters’ destinations. As explained in the graph, the mean of bus passengers has dropped, showing a decrease of 8 thousand journeys, on average.
Another pivotal change in the TfL network that took place in 2015 was the Overground line was restructured to connect more suburban rail routes, following an aquisition of rail lines acquired from the Greater Anglia services. This shift is prevalent when inspecting the mean of Overground journeys before and after 2015. Again, reasons for this include enhanced service frequency, more optimised routes and in general improved passenger experience.
before2015_og = n_journeys_m.og_n[n_journeys_m.index<='2015-01-01'].mean()
after2015_og = n_journeys_m.og_n[(n_journeys_m.index>='2015-01-01') & (n_journeys_m.index<'2020-01-01')].mean()
plt.figure(figsize=(10,6))
plt.plot(n_journeys_m.og_n[n_journeys_m.index<'2020-01-01'].index, n_journeys_m.og_n[n_journeys_m.index<'2020-01-01'])
x1, y1 = [dt.date(2010,1,1), dt.date(2015,1,1)], [before2015_og, before2015_og]
x2, y2 = [dt.date(2015,1,1), dt.date(2020,1,1)], [after2015_og, after2015_og]
plt.plot(x1, y1, x2, y2, marker = 'o')
plt.xlabel('Date')
plt.ylabel('Millions of journeys')
plt.title('Number of journeys on the London Overground 2010-2020')
plt.legend(labels = ['Number of journeys on OG', 'Mean before beginning of 2015', 'Mean after beginning of 2015'])
plt.show()
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=crime['bus_vol'].index, y=crime['bus_vol'],
line_color= 'red', name='bus_crime', fill = 'tonexty'))
fig2.add_trace(go.Scatter(x=crime['tube_vol'].index, y=crime['tube_vol'], fill='tozeroy',
line_color= 'blue', name='tube_crime'))
fig2.update_layout(title_text="Volume of Crimes on Bus vs Tube", xaxis_title="Date",
yaxis_title="Volume of Crimes", xaxis_range=[dt.date(2009,4,30), dt.date(2023,3,31)],
xaxis_rangeslider_visible=True)
fig2.show()
Studying a temporal graph of crime volumes across different transport mode throughout the studied time period, we observe a prevalent gap between tube and bus crime. Buses have a higher volume of reported incidents compared to tubes. This may be due to numerous factors, one being the difference in passenger demographics. Bus services tend to attract a more diverse group of riders compared to tube passengers, potentially creating conditions that criminals find conducive for certain activities. Furthermore, the bus network covers a broader range of routes and neighbourhoods compared to the tube network, exposing these journeys to diverse environments with higher crime rates due to socio-economic factors.
We decided to conduct a regression analysis on some key determinants in transport, to predict total number of journeys. Regression helps us understand the relationship between two or more variables, for example whether average crime rate might be associated with variations in the total number of journeys. Predicting journey numbers offers insights into policy making regarding the safety and security of passengers and commuters. This analysis is useful as it serves as a barometer of public safety within the transport network, reflecting the effectiveness of security measures and policing strategies. Regression analysis also provides statistical inference about relationships observed, allowing us to assess the strength and significance, and aid with hypothesis testing to continually improve the TfL services.
# average crime rate with a weighted mean using the volumes as weights
crime_m = crime_m.fillna(0)
avg_crime_rate = pd.DataFrame(index = crime_m.index, columns = ['value'])
for i in range(len(crime)):
avg_crime_rate.iloc[i] = np.average(crime_m.iloc[i][[ 'bus_rate','tube_rate', 'dlr_rate',
'tram_rate', 'og_rate', 'eliz_rate']],
weights = crime_m.iloc[i][['bus_vol','tube_vol', 'dlr_vol',
'tram_vol', 'og_vol', 'eliz_vol']])
# aggregating monthly to compare
n_journeys_total_m = n_journeys_total.groupby(n_journeys_total.index.to_period('M').to_timestamp()).mean()
We created an average crime rate for each month taking the weighted average of registered crime rates per mode of transport having the volume of crime as weights.
# Preparing the data for the regression
data_regr = pd.concat([avg_crime_rate, n_journeys_total_m], axis = 1).dropna()
data_regr.columns = ['avg_crime_rate','n_journeys_total_m']
X = data_regr.iloc[:,0].astype(float)
y = data_regr.iloc[:,1]
X = sm.add_constant(X)
mod = sm.OLS(y,X)
fit = mod.fit()
fit.summary()
| Dep. Variable: | n_journeys_total_m | R-squared: | 0.572 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.569 |
| Method: | Least Squares | F-statistic: | 200.2 |
| Date: | Thu, 25 Jan 2024 | Prob (F-statistic): | 2.09e-29 |
| Time: | 23:24:33 | Log-Likelihood: | -775.35 |
| No. Observations: | 152 | AIC: | 1555. |
| Df Residuals: | 150 | BIC: | 1561. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 423.4630 | 11.101 | 38.145 | 0.000 | 401.527 | 445.398 |
| avg_crime_rate | -14.8890 | 1.052 | -14.148 | 0.000 | -16.968 | -12.810 |
| Omnibus: | 12.502 | Durbin-Watson: | 0.815 |
|---|---|---|---|
| Prob(Omnibus): | 0.002 | Jarque-Bera (JB): | 28.373 |
| Skew: | 0.251 | Prob(JB): | 6.90e-07 |
| Kurtosis: | 5.056 | Cond. No. | 36.4 |
data_regr['year'] = data_regr.index.year
px.scatter(data_regr,x = 'avg_crime_rate', y = 'n_journeys_total_m',
color = 'year',trendline='ols', title = 'Average Crime Rate to Number of Journeys (in million)')
We initially run a regression analysis on average crime rate to predict number of journeys. From the analysis we observe the negative coefficient of -14.89, suggesting a steep negative relationship, as average crime rate increases by 1 measurement point, the number of journeys made decreases by 14.89. The coefficients are statistically significant, as the associated p-values are close to zero. This is also visualised in the scatter plot below as the data points have a strong negative line of best fit. The adjusted R2 is 0.569, accounting for the number of predictors in the model, indicating that approximately 56.9% of the variability in the total number of journeys is explained by the model. Standard errors assume that the covariance matrix of the errors is correctly specified, and the covariance type is non-robust.
[n_journeys_tube ~ performance_tube]
data_regr = pd.concat([service_tube_m, n_journeys_m.tube_n], axis = 1).dropna()
X = data_regr['performance']
y = data_regr['tube_n']
X = sm.add_constant(X)
mod=sm.OLS(y,X)
fit = mod.fit()
fit.summary()
| Dep. Variable: | tube_n | R-squared: | 0.008 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | -0.008 |
| Method: | Least Squares | F-statistic: | 0.5009 |
| Date: | Thu, 25 Jan 2024 | Prob (F-statistic): | 0.482 |
| Time: | 23:24:34 | Log-Likelihood: | -323.24 |
| No. Observations: | 67 | AIC: | 650.5 |
| Df Residuals: | 65 | BIC: | 654.9 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 42.2292 | 50.610 | 0.834 | 0.407 | -58.845 | 143.303 |
| performance | 39.2473 | 55.453 | 0.708 | 0.482 | -71.499 | 149.994 |
| Omnibus: | 6.575 | Durbin-Watson: | 0.209 |
|---|---|---|---|
| Prob(Omnibus): | 0.037 | Jarque-Bera (JB): | 6.688 |
| Skew: | -0.740 | Prob(JB): | 0.0353 |
| Kurtosis: | 2.545 | Cond. No. | 27.2 |
data_regr['year'] = data_regr.index.year
px.scatter(data_regr,x = 'performance', y = 'tube_n', color = 'year', trendline='ols', title = 'Performance on Tube to Number of journeys (in million)')
When regressing performance of tube services against number of journeys on the tube, we now witness a positive coefficient of 39.25, which theoretically makes sense, as higher performance should result in more journeys taken. However, since the p-value of this coefficient is relatively high, this indicates that the model is not statistically significant. Similarly, a low negative adjusted R-squared suggests that the model may be too complex for the given data, resulting in a poorer fit. As these factors indicate uncertainty about the true values of the parameters, further investigation of the model may be necessary to improve its explanatory power.
Using the SARIMA model it is possible to see if a time series has been generated by a particular stochastic process using the historical data for each observation over time. In particular, the SARIMA model takes into consideration the possibility of seasonality inside a time series.
series = n_journeys_m['tube_n'].copy()
px.line(x=series.index, y=series, title="Number of Journeys (in million) on Tube over time")
adfuller(series)[1]
0.05610300261297535
The p-value for the Adfuller test is higher than 0.05 suggesting that the time series is not stationary and has to be differentiated to be fitted inside the SARIMA model.
plot_acf(series.diff().dropna())
plot_pacf(series.diff().dropna());
After differentiating the time series it is shown that the autocorrelation and partial autocorrelation between different lags is significant only at lag 6 and lag 12 suggesting a 6 month seasonality trend.
We are going to fit the SARIMA model using these informations and test it on the last 12 observation.
n = len(series)
# diving the time series in training and test set
test = series.iloc[n-12:n]
series_train = series.iloc[0:n-12]
model = SARIMAX(series_train, order=(0,1,0), seasonal_order = (3,1,0,6));
model_fit = model.fit();
print(model_fit.summary())
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 4 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 3.66157D+00 |proj g|= 3.33587D-02
At iterate 5 f= 3.65916D+00 |proj g|= 1.83189D-02
At iterate 10 f= 3.65823D+00 |proj g|= 3.08725D-05
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
4 10 12 1 0 0 3.087D-05 3.658D+00
F = 3.6582306680083843
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
SARIMAX Results
=========================================================================================
Dep. Variable: tube_n No. Observations: 149
Model: SARIMAX(0, 1, 0)x(3, 1, 0, 6) Log Likelihood -545.076
Date: Thu, 25 Jan 2024 AIC 1098.153
Time: 23:24:34 BIC 1109.976
Sample: 0 HQIC 1102.957
- 149
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.S.L6 -0.9453 0.092 -10.241 0.000 -1.126 -0.764
ar.S.L12 -0.4982 0.100 -4.986 0.000 -0.694 -0.302
ar.S.L18 -0.3452 0.076 -4.571 0.000 -0.493 -0.197
sigma2 119.9550 7.063 16.983 0.000 106.111 133.799
===================================================================================
Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 302.20
Prob(Q): 0.88 Prob(JB): 0.00
Heteroskedasticity (H): 7.84 Skew: -0.87
Prob(H) (two-sided): 0.00 Kurtosis: 9.93
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
This problem is unconstrained.
0 count 149.000000 mean 0.530390 std 13.505401 min -51.661748 25% -5.014194 50% 0.864596 75% 5.094188 max 87.531447
The Ljung-Box test has given an output with a p-value of 0.88 excluding the possibility of significant autocorrelations between lags.
The residuals are not behaving perfectly as a normal distribution because of some extreme observations, in particular during the COVID crisis. Nevertheless, some predictions on the future observations using this model can be made with a good level of caution being this an approximation.
# Rolling forecast of the following months
prevision = []
for i in range(12):
model = SARIMAX(series_train, order=(0,1,0), seasonal_order = (3,1,0,6))
# the model is applied to the training set that misses the last 12 observations
model_fit = model.fit(disp=False)
output = model_fit.forecast() # prediction of the following month
series_train = pd.concat([series_train,output])
prevision.append(output.iloc[0])
test = pd.DataFrame(test)
test['prevision'] = prevision
fig = go.Figure()
fig.add_trace(go.Scatter(x=test.index, y=test['tube_n'],
mode='lines',
name='Real values'))
fig.add_trace(go.Scatter(x=test.index, y=test['prevision'],
mode='lines',
name='SARIMA Predictions'))
fig.update_layout(title_text="Test of the SARIMA Model predictions on the number of journeys on Tube",
xaxis_title='Date',
yaxis_title="Number of journeys on Tube (in millions)")
The predictions made by the SARIMA model tends to be similar to the registrered values. From May 2023, the model tends to overestimate the number of journeys by tube. It is clear that an overestimation is more desirable than a underestimation.
We are now going to use the same model to predict the value of the future 13 months, assuming that the network will remain as it is today.
prevision = pd.Series()
for i in range(13):
model = SARIMAX(series, order=(0,1,0), seasonal_order = (3,1,0,6)) # same model already tested
model_fit = model.fit(disp=False)
output = model_fit.forecast() # prediction of the following month
series = pd.concat([series,output])
prevision = pd.concat([prevision,output])
prevision.index = pd.Series(
pd.date_range(series.index[-14], periods=13, freq="M"))
prevision_plot = pd.concat([pd.Series(series.iloc[n-1], index = [series.index[n-1]]), prevision])
fig = go.Figure()
fig.update_layout(title='Number of journeys (in millions) on Tube: from 2020, with predictions for 2024',
xaxis_title='Date',
yaxis_title='N. of journeys')
fig.add_trace(go.Scatter(x=series.iloc[120:n].index, y=series.iloc[120:n],
mode='lines',
name='Last Registered Values'))
fig.add_trace(go.Scatter(x=prevision_plot.index, y=prevision_plot,
mode='lines',
name='SARIMA Predictions'))
series = n_journeys_m['bus_n'].copy()
px.line(x=series.index, y=series, title="Number of Journeys on Bus over time")
adfuller(series)[1]
0.21675283505139398
The p-value for the Adfuller test is higher than 0.05 suggesting that the time series is not stationary and has to be differentiated to be fitted inside the SARIMA model.
plot_acf(series.diff().dropna())
plot_pacf(series.diff().dropna());
After differentiating the time series it is shown that the autocorrelation and partial autocorrelation between different lags is significant at lag 1, 4, 5, 6, 11 and 12 suggesting a 6 month stagionality trend with a dependency on the previous observations, in particular the previous month.
We are going to fit the SARIMA model using these informations and test it on the last 12 observation.
n = len(series)
test = series.iloc[n-12:n]
series_train = series.iloc[0:n-12]
model = SARIMAX(series_train, order=(1,2,1), seasonal_order = (2,1,0,6))
model_fit = model.fit()
print(model_fit.summary())
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 5 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 4.45173D+00 |proj g|= 4.25060D-02
At iterate 5 f= 4.39447D+00 |proj g|= 3.49395D-02
At iterate 10 f= 4.16095D+00 |proj g|= 1.27155D-01
At iterate 15 f= 4.14067D+00 |proj g|= 2.92150D-04
At iterate 20 f= 4.14066D+00 |proj g|= 3.61888D-04
This problem is unconstrained.
At iterate 25 f= 4.14062D+00 |proj g|= 3.51095D-03
At iterate 30 f= 4.14054D+00 |proj g|= 7.26296D-04
At iterate 35 f= 4.14053D+00 |proj g|= 4.93637D-05
At iterate 40 f= 4.14053D+00 |proj g|= 7.63569D-05
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
5 43 53 1 0 0 2.040D-06 4.141D+00
F = 4.1405250744718467
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
SARIMAX Results
==========================================================================================
Dep. Variable: bus_n No. Observations: 149
Model: SARIMAX(1, 2, 1)x(2, 1, [], 6) Log Likelihood -616.938
Date: Thu, 25 Jan 2024 AIC 1243.876
Time: 23:24:37 BIC 1258.620
Sample: 0 HQIC 1249.868
- 149
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.1275 0.060 -2.112 0.035 -0.246 -0.009
ma.L1 -1.0000 11.399 -0.088 0.930 -23.341 21.342
ar.S.L6 -0.9669 0.075 -12.853 0.000 -1.114 -0.819
ar.S.L12 -0.2084 0.062 -3.351 0.001 -0.330 -0.087
sigma2 336.4330 3825.746 0.088 0.930 -7161.891 7834.757
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 294.32
Prob(Q): 0.94 Prob(JB): 0.00
Heteroskedasticity (H): 5.38 Skew: 0.13
Prob(H) (two-sided): 0.00 Kurtosis: 10.07
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
0 count 149.000000 mean 0.722974 std 28.405356 min -133.017853 25% -7.941313 50% 0.569646 75% 7.867079 max 185.359726
As in the tube's case, the residuals are not behaving perfectly as a normal distribution because of some extreme observations, in particular during the COVID crisis.
prevision = []
for i in range(12):
model = SARIMAX(series_train, order=(1,2,0), seasonal_order = (2,1,1,6))
model_fit = model.fit(disp=False)
output = model_fit.forecast()
series_train = pd.concat([series_train,output])
prevision.append(output.iloc[0])
test = pd.DataFrame(test)
test['prevision'] = prevision
fig = go.Figure()
fig.add_trace(go.Scatter(x=test.index, y=test['bus_n'],
mode='lines',
name='Real values'))
fig.add_trace(go.Scatter(x=test.index, y=test['prevision'],
mode='lines',
name='SARIMA Predictions'))
fig.update_layout(title_text="Test of the SARIMA Model predictions on number of journeys by bus", xaxis_title="Months",
yaxis_title="Number of journeys by bus (in millions)")
The predictions made by the SARIMA model tends to follow the same trend it has been registered during the last month. Also in this case the model tends to overestimate, but this is still more desirable than a underestimation.
We are now going to use the same model to predict the value of the future 13 months.
prevision = pd.Series()
for i in range(13):
model = SARIMAX(series, order=(1,2,0), seasonal_order = (2,1,1,6))
model_fit = model.fit(disp=False)
output = model_fit.forecast()
series = pd.concat([series,output])
prevision = pd.concat([prevision,output])
prevision.index = pd.Series(pd.date_range(series.index[-14], periods=13, freq="M"))
prevision_plot = pd.concat([pd.Series(series.iloc[n-1], index = [series.index[n-1]]), prevision])
fig = go.Figure()
fig.update_layout(title='Number of journeys by bus: from 2020, with predictions for 2024',
xaxis_title='Months',
yaxis_title='N. of journeys')
fig.add_trace(go.Scatter(x=series.iloc[120:n].index, y=series.iloc[120:n],
mode='lines',
name='Last Registered Values'))
fig.add_trace(go.Scatter(x=prevision_plot.index, y=prevision_plot,
mode='lines',
name='SARIMA Predictions'))
In conclusion, our analysis of transport in London through data-driven exploration of TfL’s history, data visualisation integrating different transport modes, and regression and SARIMA modelling, has provided valuable insights into the complex dynamics of London's transport network. The project has focused on understanding the factors influencing the number of journeys across TfL's diverse transport modes. Analysing historical and contemporary datasets, exploring patterns associated with public transport usage, crime rates, and performance metrics, we have been able to answer some important research questions surrounding impacts of significant events, determinants influencing journey numbers, changes in transport-related crimes, and the correlation between performance and journeys.
Our exploration revealed the profound impact of the COVID-19 pandemic on transport patterns, with a significant drop in journeys in 2020 due to lockdowns and reduced public transport usage. Crime rates also experienced a spike during this period, likely influenced by shifts in law enforcement priorities and economic uncertainties. Regression analysis further illuminated the relationships between average crime rates, traffic flows, and tube performance with the total number of journeys. The negative relationship between average crime rates and journeys suggested that higher crime rates are associated with fewer journeys, emphasizing the importance of security in promoting public transport use.
Employing the SARIMA model, we were able to predict future values in a time series, focusing on the number of tube journeys. The Adfuller test indicates non-stationarity in the time series, requiring differentiation for SARIMA fitting. Significant autocorrelation and partial autocorrelation at lag 6 and lag 12 suggest a 6-month seasonality trend. The Ljung-Box test excludes significant autocorrelations between lags in the residuals, showing deviation from a normal distribution due to extreme observations during the COVID crisis. Predictions by the SARIMA model are cautiously made, showing similarity to registered values but tending to overestimate, which is preferable. When applying the model to predict the next 13 months after December 2023, an ongoing seasonal trend is emphasised, with dependency of slightly increasing pattern over the last months of 2023.
As this comprehensive analysis has contributed to informed decision-making in urban planning, we can conclude that to combat this potential increase in number of journeys within the TfL following a seasonal trend, enhancement of law enforcement and policy formulation is required for the future of London's transport network. By understanding the intricate connections between various factors influencing journeys, TfL can continue its mission of providing a responsive and adaptive transportation network for the well-being of London's commuters.